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Abstract 

Processed pseudogenes are copies of messenger RNAs that have been reverse transcribed into DNA and inserted 
into the genome using the enzymatic activities of active L1 elements. Processed pseudogenes generally lack 
introns, end in a 3' poly A, and are flanked by target site duplications. Until recently, very few polymorphic 
processed pseudogenes had been discovered in mammalian genomes. Now several studies have found a number 
of polymorphic processed pseudogenes in humans. Moreover, processed pseudogenes can occur in somatic cells, 
including in various cancers and in early fetal development. One recent somatic insertion of a processed 
pseudogene has caused a Mendelian X-linked disease, chronic granulomatous disease. 
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Background 

Pseudogenes are sequences present in essentially all ani- 
mal genomes that have many characteristics of genes, 
but are defective for production of protein. Of course, 
like most definitions that are 30 years old and based on 
incomplete information, this one has also been modified. 
We now know of many pseudogenes that are active in 
making proteins. Of the more than 14,000 pseudogenes 
in the human genome [1], at least 10% are no longer 
'pseudogenes' and are active [1,2]. Many active 'pseudo- 
genes' are gene duplicates that contain introns and are 
situated in close proximity to their active gene copies. 
These gene duplicates make up one class of pseudo- 
genes. An interesting example of a duplicate pseudogene 
is the (fit; gene in the a-globin gene cluster [3]. This 
pseudogene has only six nucleotide differences from its 
parent ( (zeta) gene, and one of these differences leads 
to a nonsense codon. In eight populations studied, the 
nonsense codon is corrected by gene conversion in 15% 
to 50% of a-globin gene clusters. However, RNA eman- 
ating from the corrected <f>( gene could not be detected 
[3]. 

Although there are many duplicate pseudogenes in the 
human genome, the majority of human pseudogenes, 
more than 7,800 [1], belong to the second class, and are 
called processed pseudogenes (PPs). The term processed 
pseudogene was first proposed in 1977 to describe a 
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sequence of a 5S gene of Xenopus laevis [4], PPs are 
found in the genomes of many animal species [2] and 
have the following characteristics: 1) their sequences are 
very similar to the transcribed portion of the parent 
gene; 2) they lack all or most introns, so they appear to 
be cDNA copies of processed mRNAs; 3) they have a 
poly A tail attached to the 3'-most transcribed nucleo- 
tide; and 4) they are flanked at their 5' and 3' ends by 
target site duplications (TSDs) of 5 to 20 nucleotides. 
The cDNA copies of mRNAs, the source of PPs, are 
inserted in far- flung regions of the genome [5]. At least 
10% of PPs retain activity because when dispersed they 
have fortuitously landed close to an RNA polymerase II 
promoter [2]. We have known for ten years that the se- 
quence characteristics of PPs are signs of mobilization 
by the endonuclease and reverse transcriptase activities 
of active LINE-1 (LI) elements [6,7]. In human cells, Lis 
have been shown to mobilize SINEs such as Alus [8,9], 
SVAs [10,11], and small nuclear (sn) RNAs [12], along 
with many mRNA transcripts. In mouse cells, Lis also 
mobilize Bl and B2 SINE elements [13]. More than 
2,075 human genes are represented by at least one PP in 
the genome, while some genes, such as GAPDH, riboso- 
mal proteins and actin f> have 50 to 100 PPs [14]. Why 
10% of human genes are represented by PPs, while the 
remaining 90% are not, is an important unanswered 
question. 

A number of quite interesting PPs have been identi- 
fied. In one example, the phosphoglycerate kinase gene, 
pgk2, is an active testis-expressed PP derived from the 
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X-linked pgkl gene [15]. Deficiency of pgl<2 leads to se- 
vere reduction in male fertility [16]. Another example is 
the fgf4 (fibroblast growth factor 4) PP in a number of 
dog breeds. This activated fgf4 PP is responsible for a 
chondrodysplasia that leads to the short-legged pheno- 
type of 19 dog breeds, including dachsund, basset hound 
and corgi [17]. A third example is the CypA pseudogene 
that has inserted into the TRIM5 gene at least twice, 
once in the owl monkey [18] and another time in the 
macaque lineage [19,20]. The TRIM-Cyp fusion gene 
leads to HIV-1 resistance of the monkeys because the 
TRIM-Cyp fusion protein blocks entry of the virus into 
cells [18]. 

There is another class of PPs termed semi-processed 
pseudogenes, which retain some introns and are particu- 
larly prevalent in the mouse and rat. For example, in the 
mouse the preproinsulin II gene has two introns, while 
the preproinsulin I gene is a PP that retains one of the 
two introns [21]. However, until very recently the pre- 
vailing view has been that there is very little ongoing PP 



formation in mammals. Now we know that that view is 
wrong. There is significant PP formation in present day 
human beings. 

Recent processed pseudogene insertions 

About one year ago, a comprehensive paper on poly- 
morphism among PPs in human beings appeared. Ewing 
et al. devised a bioinformatic pipeline to detect poly- 
morphic PPs. Using discordant reads not present in ref- 
erence genomes, they found 48 novel PP insertion sites 
among 939 low pass genomes from the 1,000 genomes 
project [22]. These PPs came from a wide variety of 
source genes, and were spread throughout the human 
chromosomes (Figure 1). All 48 of these polymorphic 
PPs were confirmed by locating the precise genomic in- 
sertion site. This group also studied the genome se- 
quences of 85 human cancer-normal tissue pairs 
representing a variety of cancers. Among these cancers 
they found the first instances of somatic insertion of 
PPs; three PPs were predicted to occur in lung cancers 



RPS3A 




Figure 1 Locations of 48 non-reference gene processed pseudogene insertions sites in the human genome based on reads mapped to 
source genes. Discordant read mappings are represented by links colored based on chromosome of the source gene. Insertion sites are 
represented by black circles and the gene labels are based on the position of the source gene. Republished with permission from 
Nature Communications. 
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that were absent from paired normal tissue. The authors 
also estimated the rate of PP insertion in human beings 
at one insertion in every approximately 5,200 individ- 
uals/generation [22]. 

Ewing et al. went on to study PP polymorphism 
among mice, finding 755 new polymorphic PPs with 
most PPs occurring in species and subspecies derived 
from wild mice. Among these, Mus musculus castaneus, 
M.m. musculus, and M.m. spretus had 213, 212 and 142 
PPs in their genomes, respectively, that were not found 
in the inbred C57B16 genome. However, on average, each 
of the 12 inbred strains derived from C57B16 were genet- 
ically closer, but still differed from one another by 68 
PPs on average. The much greater number of poly- 
morphic PPs in mouse strains compared to individual 
human beings may be due to the much larger number of 
active Lis present in the mouse (approximately 3,000 
versus approximately 100 in humans) [23,24]. Ewing et 
al. also studied the genome sequences of ten chimpan- 
zees and found ten polymorphic PPs among these ani- 
mals. This paper represented the first comprehensive 
look at the question of PP insertions in humans, mice 
and chimpanzees, and the first study of somatic inser- 
tion of PPs in cancer. 

Two other papers demonstrating polymorphism of PPs 
in humans have now appeared. Using exon-exon junc- 
tion spanning reads, Abyzov et al. found 147 novel puta- 
tive processed pseudogenes among approximately 1,000 
low-pass genome sequences [25]. Thirty-six of these 
147 were confirmed as polymorphic in humans by detec- 
tion of the genomic insertion point. Interestingly, the 
parental genes of non-reference PPs were significantly 
enriched among genes expressed at the M-to-Gl transi- 
tion in the cell cycle. Schrider et al. also mapped proc- 
essed pseudogenes among 17 individuals, mostly using 
exon-exon junction spanning reads from SOLID and 
1,000 genomes data [26]. They found 21 PPs not present 
in the reference genome and presumably polymorphic; 
17 of these 21 were confirmed by PCR (See [27] for a re- 
cent review of these papers). 

Recently, Cooke et al. studied somatic PP insertion in 
cancer in greater detail [28]. They analyzed 660 cancer- 
normal pairs of sequenced samples at Wellcome Trust 
representing a variety of different cancers. In 17 or 2.5% 
of the cancers, they found 42 somatic PPs. The authors 
noted the presence of five PPs in non-small cell lung 
cancer among 27 cancers studied, similar to the Ewing 
et al. finding of somatic PPs in lung cancer. Additionally, 
they found two PPs in eleven colorectal cancer samples. 

The PP insertions in cancer were thoroughly charac- 
terized and all had the molecular signatures of germ line 
LI insertions. The majority had TSDs of 5 to 20 base 
pairs, 74% were 5' truncated (a percentage similar to that 
of human-specific Lis), 20% had inversions at their 5' 



ends due to 'twin priming' (again similar to the rate in 
germ line human LI insertions) [29], and long poly A 
tracts. In a lung adenocarcinoma, one insertion was as- 
sociated with an 8 kb deletion of the promoter and exon 
1 of a tumor suppressor gene, MGA1. The deletion 
knocked out expression of that allele as determined by 
RNA-seq. 

Among the PPs in cancer, most were derived from 
highly expressed transcripts, yet many were not. In 
addition, many PP insertions appeared to be early events 
in tumor formation, being present in an early lesion 
along with the tumor or in multiple sections of the same 
tumor. However, some PP insertions were shown to be 
later events in tumor progression because they were not 
detected in all sections of the same tumor. 

A final paper nailed down the potential for PP forma- 
tion during early development in humans. This paper by 
de Boer et al. described a case of the X-linked disorder, 
chronic granulomatous disease in a Dutch man [30]. 
This man, now a young adult, had suffered from mul- 
tiple bouts of pulmonary aspergillosis as a child. On 
workup of his CYBB (cytochrome b-245, beta polypep- 
tide) gene, the defective gene in the disorder and paren- 
thetically the first human gene cloned by positional 
cloning [31], it was discovered that a PP insertion had 
knocked out the gene's activity. 

There are three interesting aspects of this case. First, 
the insertion was a semi-processed pseudogene of the 
TMF1 (TATA element modulatory factor) gene from 
chromosome 3 that had inserted into intron 1 of CYBB 
in reverse orientation. A PP had not been observed pre- 
viously as a new insertion among 100 previous insertions 
(LI, Alu, SVA) in human Mendelian disease or cancer 
etiology [32]. Interestingly, TMF1 is one of the about 
10% of human genes that is represented by a single PP 
in the human reference genome sequence [14]. Second, 
the insertion was 3' truncated and contained exons 1 to 
8 of TMF1 along with intron 7 and much of intron 8. 
Transcription of TMF1 had terminated after an alterna- 
tive poly A signal, AGUAAA, in intron 8, and a 100 bp 
poly A tail was added to the transcript. After insertion of 
this semi-processed pseudogene in reverse orientation 
into intron 1 of CYBB, splicing had occurred into an ex- 
cellent acceptor splice site and out of an excellent donor 
site in exon 2 of TMF1. The newly created 117 bp exon 
also contained a nonsense codon that caused the CYBB 
gene to be non-functional (Figure 2). Finally, the PP in- 
sertion had occurred during early embryonic develop- 
ment of the patient's mother. Roughly 10% to 20% of her 
lymphocytes contained the insertion as shown by qPCR. 

To date, somatic retrotransposition in Mendelian dis- 
ease has been rarely found. Among the 100 cases men- 
tioned above, there is only a somatic insertion into the 
adenomatous polyposis coli (APC) tumor suppressor 
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Figure 2 Orientation of the TMF1 insertion in intron 1 of the CyBB gene (below), leading to an extra exon between exons 1 and 2 in 
the CYBB mRNA (above). Republished with permission from Human Mutation published by Wiley. 



gene in a colorectal cancer case [33] and somatic and 
germ line mosaicism in the mother of a patient with the 
X-linked disease, choroideremia [34]. Thus, after more 
than 20 years since the discovery of the first retrotran- 
sposition events due to LI and Alu elements [35,36], we 
finally have definitive evidence of retrotransposition of 
processed pseudogenes in human somatic cells (cancer 
and early development). 

These papers beg the question, why do PP insertions 
not occur more frequently? Another recent paper has 
provided evidence that the RNAs associated with the LI 
ORF1 protein in the LI ribonucleoprotein particle (LI 
RNP) contain a preponderance of those mRNAs that 
form PPs [37]. These mRNAs also have a much greater 
capacity for reverse transcription by LI ORF2 protein 
than mRNAs that do not form PPs [37,38]. Now that we 
know that PP formation can occur in somatic cells, it is 
logical that those mRNAs that are both located in LI 
RNPs and capable of reverse transcription have the in- 
side track in PP formation. Messenger RNAs that lack 
what it takes to associate with the LI RNP and be re- 
verse transcribed, perhaps due to deficient cellular con- 
centration or their sequence characteristics, are unable 
to form PPs. However, the story is not quite so simple 
since the majority of mRNAs that have formed PPs in 
the human genome do not appear to be associated with 
the LI RNP. Thus, the demonstration of somatic PP in- 
sertions leads to a new as yet unanswered question: 



What are the important factors that increase the likeli- 
hood that a particular mRNA will become a processed 
pseudogene? 

Conclusions 

Although perhaps unexpected, the evidence is over- 
whelming that PPs continue to insert in the germ line 
and in somatic cells of human beings. 
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RNP: ribonucleoprotein particle. 
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